Intro

Let’s start by reading in the packages we’ll need, setting the working directory and reading in the red wine data.

Examining the red wine data

Let’s look at the first few entries to see what the data looks like, and see how many samples of the data we have.

head(reds)
##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.4             0.70        0.00            1.9     0.076
## 2 2           7.8             0.88        0.00            2.6     0.098
## 3 3           7.8             0.76        0.04            2.3     0.092
## 4 4          11.2             0.28        0.56            1.9     0.075
## 5 5           7.4             0.70        0.00            1.9     0.076
## 6 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5
dim(reds)
## [1] 1599   13

Univariate analysis

Bar plot of number of wine samples in each quality category

The first thing is to look at the distribution of wines vs. quality. We see roughly normal distribution, perhaps slightly skewed.In particular, there are very few wines in the highest quality bins. This makes sense, since high-quality wines are relatively rare and difficult to create.

Feature boxplots and distributions

Next, we’ll look at the distributions for each of the chemical properties of wine, first for all wines, and then in a boxplot sorted by quality. This should give us a good idea of how each property is distributed overall and how this distribution varies with wine quality.

Fixed acidity

The median of the fixed acidity increases with wine quality, though there are a number of outliers with large fixed acidity in the middle quality bins. Of course, there are a lot more samples in those bins.

The distribution is normal and unimodal, but with a fatter tail on the high side.

Volatile acidity

Here, the relationship is pretty clear. The low quality wines have a high median volatile acidity, whereas high-quality wines have much less.
Volatile acidity measures acetic acid (vinegar) and other impurities, so this relationship makes sense.

The distribution is not far off–could be a noisy normal or bimodal type distribution.

Citric acid

Median citric acid increases with wine quality, although there seem to be a bunch of outliers in quality bin 7 (all zero). Possible measurement error?

Citric acid is also the one distribution that is clearly non-normal–it’s multi-modal and skewed toward the low side.

Residual sugar

Residual sugar seems to be roughly equivalent for the different quality bins, although far outliers in the middle bins make the scale hard to read.

The overall distribution shows that values are normal around the mean/median, but with far outliers on the high side.

Chlorides

There appears trend of decreasing chlorides in the higher quality bins, but this is a bit obscured by a number of far outliers that affect the scale.

As with residual sugar, the overall distribution for chlorides is normal around the mean/median, with far outliers on the high side.

Free sulfur dioxide

Median largest in the mid-quality wines, lower at each extreme. Outliers less extreme than the previous two.

The overall distribution is unimodal, but somewhat asymmetric.

Total sulfur dioxide

Free sulfur dioxide has a similar profile to total sulfur dioxide. Might be good to look for a correlation here.

The overall distribution is similar to free sulfur dioxide.

Density

Median density declines with wine quality, especially in the highest bins.

The overall distribution of density is roughly normal, but with fatter tails.

pH

Median pH also declines with wine quality, but the ranges are have a lot of overlap.

The overall distribution of pH is similar to density (normal w/fat tails) but a bit noisier.

Sulphates

Just when you were getting bored, here;s another clear relationship. Median sulphates increase with increasing quality.

The overall distribution is normal-ish, iwth a long tail on the high end.

Alcohol

Low quality wines have a relatively low alcohol content, but this goes up in bins 6 and above. A low alcohol level could be a symptom of wine going to vinegar, couldn’t it (although there could be other factors)? Good to check for correlation to volatile acidity.

Overall distribution is unimodal, but with a giant, noisy tail on the high end.

Correlations of individual properties w/ quality

As a final step in our univariate analysis, let’s look at how strongly each property is correlated to wine quality:

##        fixed.acidity     volatile.acidity          citric.acid 
##           0.12405165          -0.39055778           0.22637251 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##           0.01373164          -0.12890656          -0.05065606 
## total.sulfur.dioxide              density                   pH 
##          -0.18510029          -0.17491923          -0.05773139 
##            sulphates              alcohol              quality 
##           0.25139708           0.47616632           1.00000000

This shows the strongest positive correlation to alcohol content, and the strongest negative correlation to volatile acidity, which makes sense.

Other factors with relatively high correlations to quality (+ or -) are sulphates and citric acid.

Multivariate Analysis

Now that we have looked all of the properties individually and inspected how they vary in the different quality bins, it’s time to look at how the different combinations of qualities might affect the quality of a wine.

Correlation heatmap of properties

The first thing that popped out at me is that many of these properties are related, just because of chemistry. So, I did a correlation heatplot to find variables that are not completely independent.

Here it is:

Some things to notice here:

  • Fixed acidity and pH appear to be related, likewise, citric acid and pH. (Chemistry!)
  • As I speculated above, free sulfur dioxide and total sulfur dioxide are related. Again, probably Chemistry at work.
  • Density is inversely related to alcohol content as you would expect since alcohol is less dense than water. It is also related to fixed acidity, but I’m not sure that’s a causal relationship.

So, the conclusion here is that many of the properties give redundant information. Maybe we can decrease the dimensionality somehow.

Note: I’m sorry the graph is so scrunched in the PDF. It looks great on a large monitor.

Scatterplots

The next step is do look at two-dimensional scatterplots and see the relationships between some of the properties that seem most important in determining the quality of a red wind.

Let’s look at a scatterplot matrix of the some of the more significant features.

Anyhow, this doesn’t show us too much more than we saw in the heatmap. Along the diagonal you can see that the distribution of most of the properties is roughly normal (Gaussian). Citric acid is a notable exception.

This may affect the ways we choose to analyze the data.

Other interesting bivariate relationships

Volatile acidity v. alcohol

Volatile acididty v. alcohol shows clear trends: higher quality wines have higher alcohol and lower volatile adidity.

Volatile acidity v. alcohol

This graph is interesting in that it shows a marked clustering of wines in quality bin 5 at low alcohol and high volatile acidity. Bin 5 is the lowest quality bin with a significant number of samples in it.

##### Sulfur dioxide v. volatile acidity

This looks like one of the better scatterplots for separating high quality from low quality wines. Again, we see a bunch of bin 5 wines clustered at low alcohol.

PCA for red wine

Since most of the properties are normally distributed, I decided to try a principal components analysis to reduce the dimensionality of the wine dataset.

I got some code from the interwebs for this one (can’t find the exact reference).

But the code and results seem reasonable, so here goes

The first step is to rescale the data (standardize range and stdev), then run the PCA.

  wine <- reds

  s <- as.data.frame(scale(wine[2:12]))
  wine.pca <- prcomp(s) 

Here is a summary of the results:

  summary(wine.pca)
## Importance of components%s:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.7604 1.3878 1.2452 1.1015 0.97943 0.81216 0.76406
## Proportion of Variance 0.2817 0.1751 0.1410 0.1103 0.08721 0.05996 0.05307
## Cumulative Proportion  0.2817 0.4568 0.5978 0.7081 0.79528 0.85525 0.90832
##                            PC8     PC9    PC10    PC11
## Standard deviation     0.65035 0.58706 0.42583 0.24405
## Proportion of Variance 0.03845 0.03133 0.01648 0.00541
## Cumulative Proportion  0.94677 0.97810 0.99459 1.00000

And a screeplot, which shows the amount how much of the variance each of the new components accounts for.

  screeplot(wine.pca, type="lines")

There isn’t a real cutoff in the screeplot, but it is clear that the first 4-5 principal account for most of the variance. So let’s have a look at them.

First princical axis

The first PA seems to relate to general acidity. It has a weakly positive relationship to quality.

  wine.pca$rotation[,1]
##        fixed.acidity     volatile.acidity          citric.acid 
##           0.48931422          -0.23858436           0.46363166 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##           0.14610715           0.21224658          -0.03615752 
## total.sulfur.dioxide              density                   pH 
##           0.02357485           0.39535301          -0.43851962 
##            sulphates              alcohol 
##           0.24292133          -0.11323206
  first_pa <- wine.pca$x[, 1]
  scatterplot(first_pa ~ wine$quality, 
              xlab="Wine quality", 
              ylab="First PA", 
              main="Wine quality vs first PA (axis of acidity)", 
              labels=row.names(wine))

  lm(first_pa~wine$quality)
## 
## Call:
## lm(formula = first_pa ~ wine$quality)
## 
## Coefficients:
##  (Intercept)  wine$quality  
##      -1.3558        0.2406

Second princical axis

The second PA has high sulfur dioxide, high volatile acids and low alcohol (yuk). It falls off substantially in the higher-quality wines.

  wine.pca$rotation[,2]
##        fixed.acidity     volatile.acidity          citric.acid 
##         -0.110502738          0.274930480         -0.151791356 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##          0.272080238          0.148051555          0.513566812 
## total.sulfur.dioxide              density                   pH 
##          0.569486959          0.233575490          0.006710793 
##            sulphates              alcohol 
##         -0.037553916         -0.386180959
  second_pa <- wine.pca$x[, 2]
  scatterplot(second_pa ~ wine$quality, 
              xlab="Wine quality", 
              ylab="Second PA", 
              main="Wine quality vs second PA (axis of funk)", 
              labels=row.names(wine))

  lm(second_pa~wine$quality)
## 
## Call:
## lm(formula = second_pa ~ wine$quality)
## 
## Coefficients:
##  (Intercept)  wine$quality  
##       3.7463       -0.6647

Third princical axis

The third PA is characterized by high volatile acidity and low alcohol. It also has low sulfur dioxide, although the meaning of this is less clear. It is inversely related to wine quality. Basically, vinegar.

  third_pa <- wine.pca$x[, 3]
  wine.pca$rotation[,3]
##        fixed.acidity     volatile.acidity          citric.acid 
##           0.12330157           0.44996253          -0.23824707 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##          -0.10128338           0.09261383          -0.42879287 
## total.sulfur.dioxide              density                   pH 
##          -0.32241450           0.33887135          -0.05769735 
##            sulphates              alcohol 
##          -0.27978615          -0.47167322
  lm(third_pa~wine$quality)
## 
## Call:
## lm(formula = third_pa ~ wine$quality)
## 
## Coefficients:
##  (Intercept)  wine$quality  
##       3.4698       -0.6156
  scatterplot(third_pa ~ wine$quality, 
              xlab="Wine quality", 
              ylab="Third PA ", 
              main="Wine quality vs third PA (axis of vinegar)")

Fourth princical axis

The fourth PA is very boring. Nothing strongly in the mix and no noticeble effect on quality.

  fourth_pa <- wine.pca$x[, 4]
  wine.pca$rotation[,4]
##        fixed.acidity     volatile.acidity          citric.acid 
##         -0.229617370          0.078959783         -0.079418256 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##         -0.372792562          0.666194756         -0.043537818 
## total.sulfur.dioxide              density                   pH 
##         -0.034577115         -0.174499758         -0.003787746 
##            sulphates              alcohol 
##          0.550872362         -0.122181088
  scatterplot(fourth_pa ~ wine$quality, 
              xlab="Wine quality", 
              ylab="Fourth PA ", 
              main="Wine quality vs fourth PA (axis of nothingburger)")

  lm(fourth_pa~wine$quality)
## 
## Call:
## lm(formula = fourth_pa ~ wine$quality)
## 
## Coefficients:
##  (Intercept)  wine$quality  
##      0.33946      -0.06023

Fifth princical axis

We’ll look at one more PA. This one seems to be characterized mostly by a lack of residual sugar, and, again, the effect on quality is minor.

  fifth_pa <- wine.pca$x[, 4]
  wine.pca$rotation[,5]
##        fixed.acidity     volatile.acidity          citric.acid 
##           0.08261366          -0.21873452           0.05857268 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##          -0.73214429          -0.24650090           0.15915198 
## total.sulfur.dioxide              density                   pH 
##           0.22246456          -0.15707671          -0.26752977 
##            sulphates              alcohol 
##          -0.22596222          -0.35068141
  scatterplot(fifth_pa ~ wine$quality, 
              xlab="Wine quality", 
              ylab="Fifth PA ", 
              main="Wine quality vs fifth PA (axis of dryness)")

  lm(fifth_pa~wine$quality)
## 
## Call:
## lm(formula = fifth_pa ~ wine$quality)
## 
## Coefficients:
##  (Intercept)  wine$quality  
##      0.33946      -0.06023

LDA for red wine

Next is an linear discriminant analysis for the wine data. It should show the main axis that determines quality as a function of all the other properties.

It looks like the LD1 accounts for most of the quality variation, but the scatterplot shows that there is too much overlap in the distributions to be able to reliably sort out any but the highest and lowest quality wines.

Although there is overlap in the distributions, this does look like a reasonable measure of quality.

One problem: there seems to be an anomaly in the density. Perhaps it scaled badly, being so close to 1?

  wine <- reds
  library('MASS')

  wine_features <- wine
  wine_features$quality <- NULL


  s <- as.data.frame(scale(wine_features))
  wine.lda <-
  lda(wine$quality ~ wine$fixed.acidity + wine$volatile.acidity+     wine$citric.acid + wine$residual.sugar + wine$chlorides + wine$free.sulfur.dioxide + wine$total.sulfur.dioxide + wine$density + wine$pH + wine$sulphates + wine$alcohol)
  wine.lda
## Call:
## lda(wine$quality ~ wine$fixed.acidity + wine$volatile.acidity + 
##     wine$citric.acid + wine$residual.sugar + wine$chlorides + 
##     wine$free.sulfur.dioxide + wine$total.sulfur.dioxide + wine$density + 
##     wine$pH + wine$sulphates + wine$alcohol)
## 
## Prior probabilities of groups:
##           3           4           5           6           7           8 
## 0.006253909 0.033145716 0.425891182 0.398999375 0.124452783 0.011257036 
## 
## Group means:
##   wine$fixed.acidity wine$volatile.acidity wine$citric.acid
## 3           8.360000             0.8845000        0.1710000
## 4           7.779245             0.6939623        0.1741509
## 5           8.167254             0.5770411        0.2436858
## 6           8.347179             0.4974843        0.2738245
## 7           8.872362             0.4039196        0.3751759
## 8           8.566667             0.4233333        0.3911111
##   wine$residual.sugar wine$chlorides wine$free.sulfur.dioxide
## 3            2.635000     0.12250000                 11.00000
## 4            2.694340     0.09067925                 12.26415
## 5            2.528855     0.09273568                 16.98385
## 6            2.477194     0.08495611                 15.71160
## 7            2.720603     0.07658794                 14.04523
## 8            2.577778     0.06844444                 13.27778
##   wine$total.sulfur.dioxide wine$density  wine$pH wine$sulphates
## 3                  24.90000    0.9974640 3.398000      0.5700000
## 4                  36.24528    0.9965425 3.381509      0.5964151
## 5                  56.51395    0.9971036 3.304949      0.6209692
## 6                  40.86991    0.9966151 3.318072      0.6753292
## 7                  35.02010    0.9961043 3.290754      0.7412563
## 8                  33.44444    0.9952122 3.267222      0.7677778
##   wine$alcohol
## 3     9.955000
## 4    10.265094
## 5     9.899706
## 6    10.629519
## 7    11.465913
## 8    12.094444
## 
## Coefficients of linear discriminants:
##                                     LD1           LD2          LD3
## wine$fixed.acidity           0.15576218  -0.510826253  -0.13230726
## wine$volatile.acidity       -2.14869965  -5.169157664  -2.80464132
## wine$citric.acid            -0.24353923  -1.810902037  -3.67023468
## wine$residual.sugar          0.09907188  -0.310654752  -0.27785760
## wine$chlorides              -4.49075830  -3.286220068   4.88913726
## wine$free.sulfur.dioxide     0.01015280   0.002518588   0.05746815
## wine$total.sulfur.dioxide   -0.01066123   0.015340541  -0.02412087
## wine$density              -132.46861030 494.715527325 442.83751116
## wine$pH                     -0.27624041  -4.797254644   0.75289349
## wine$sulphates               2.55180806  -0.768377584  -0.57558078
## wine$alcohol                 0.67697595   0.270197247   0.18108052
##                                     LD4           LD5
## wine$fixed.acidity         -1.151995674  1.826028e-01
## wine$volatile.acidity       2.625991390 -2.404376e+00
## wine$citric.acid            1.097971759 -2.639100e+00
## wine$residual.sugar        -0.399628931  4.467281e-01
## wine$chlorides             -8.619928322 -7.425094e+00
## wine$free.sulfur.dioxide    0.020697814 -5.792837e-02
## wine$total.sulfur.dioxide  -0.009733169  7.784032e-03
## wine$density              569.905215399 -4.286164e+02
## wine$pH                    -8.470107031  2.487246e+00
## wine$sulphates              0.055588624  8.097894e-01
## wine$alcohol                0.800344973 -5.659672e-01
## 
## Proportion of trace:
##    LD1    LD2    LD3    LD4    LD5 
## 0.8496 0.1028 0.0333 0.0086 0.0056
# Do a prediction
  wine.lda.values <- predict(wine.lda, s$quality)
  first_lda <- wine.lda.values$x[,1]
  scatterplot(wine$quality, wine.lda.values$x[,1])

  # ldahist(data = wine.lda.values$x[,1], g=wine$quality)

Final Plots and Summary

In the end, if the PCA analysis has some validity, you can see that two of the principal differences in wine were less important in determining quality: the general acidity and the residual sugar.

But two others (PA2 and PA3) seem to be signatures of problems in winemaking. PA3, with its high volatile acidity and low alcohol, is probably related to wine going to vineagar.

PA2 has some volatile acidity and low alcohol, but is mainly distinguished by a high total sulfur dioxide. This can also lend an off-taste to wine.

##        fixed.acidity     volatile.acidity          citric.acid 
##         -0.110502738          0.274930480         -0.151791356 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##          0.272080238          0.148051555          0.513566812 
## total.sulfur.dioxide              density                   pH 
##          0.569486959          0.233575490          0.006710793 
##            sulphates              alcohol 
##         -0.037553916         -0.386180959

## 
## Call:
## lm(formula = second_pa ~ wine$quality)
## 
## Coefficients:
##  (Intercept)  wine$quality  
##       3.7463       -0.6647

If you look at a plot of alcohol vs. total sulfur dioxide, you see an interesting cluser of wines of quality 5 at low alcohol and relatively high sulfur dioxide. I must say that I have no idea what this means.

Here is a useful summary of wine faults: https://wine.appstate.edu/sites/wine.appstate.edu/files/Chart%20Aromas%20FH_0.pdf

Reflection

Tolstoy said that all happy families are alike, but each unhappy family is unhappy in its own way. Perhaps not true for families, but true enough for wine, at least for the wine in the top bins vs. the wines in the middle and lower bins.

What distinguishes the higher quality wine is the absense wine faults. It is fairly easy to distinguish the best wine from the others by the absense of these faults. PCS/LCA analysis shows some possible combinations of features that could be a signature of wine faults, but of course it’s just exploratory.

As far as the wines in the middle quality bins (that is, the overwhelming majority of the wine samples), the picture becomes more hazy because there are many types of wine fault, so wines can be less-than-perfect in many different ways, in different degrees and different combinations. I’m not sure if the mid-quality wines are blended, in which case, you’d expect wines with a complementary faults to be mixed together, which would muddy the water still more.

Anyhow, nice project, more fun than I thought.